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Abstract Acquisition of new proteins by viruses usually 
occurs through horizontal gene transfer or through gene 
duplication, but another, less common mechanism is the 
usage of completely or partially overlapping reading 
frames. A case of acquisition of a completely new protein 
through introduction of a start codon in an alternative 
reading frame is the protein encoded by open reading frame 
(orf) 9b of SARS coronavirus. This gene completely 
overlaps with the nucleocapsid (N) gene (orf9a). Our 
findings indicate that the orf9b gene features a discordant 
codon-usage pattern. We analyzed the evolution of orf9b in 
concert with orf9a using sequence data of betacoronavirus- 
lineage b and found that orf9b, which encodes the over- 
printing protein, evolved largely independent of the over- 
printed orf9a. We also examined the protein products of 
these genomic sequences for their structural flexibility and 
found that it 1s not necessary for a newly acquired, over- 
lapping protein product to be intrinsically disordered, in 
contrast to earlier suggestions. Our findings contribute to 
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characterizing sequence properties of newly acquired genes 
making use of overlapping reading frames. 
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Introduction 


A codon is composed of three nucleotides and therefore, 
three reading frames are, in principle, possible within the 
same gene. The correct reading frame is determined by the 
start codon, usually ATG or, in RNA genomes, AUG. 
Occasionally, the same stretch of genome codes for more 
than one protein in different reading frames; in this case, the 
protein products are called “overprinted” (or “ancestral’’) 
and “overprinting” (or “novel”’) [1]. For simplicity, proteins 
encoded by overlapping genes are often called “overlapping 
proteins” or “transframe proteins” [2], because during their 
translation, there occurs either a +1 or —1 shift in the 
reading frame [3]. Overlapping reading frames have been 
discovered in many organisms, but they are most commonly 
found in viruses because of their comparatively small-size 
genomes [3-7]. It has been proposed that translation of 
overlapping reading frames in RNA viruses occurs via leaky 
ribosomal scanning [8-13], ribosomal frameshifting [14— 
17], or stop-codon readthrough [18-20]. Recently, a -2 
programmed ribosomal frameshift has been observed in the 
synthesis of an overlapping protein in members of the family 
Arteriviridae [21, 22]. 

In terms of evolution, overlapping genes are considered 
a mechanism for creating novel proteins [1, 5]. However, 
any point mutation occurring in an overlapping gene region 
affects two (or more) protein products at the same time. 
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Fig. 1 a Schematic organization of the SARS-coronavirus genome, 
highlighting structural, and accessory genes. The overprinting acces- 
sory genes are indicated below their overprinted mates. b Overlapping 
N and orf9b genes of SARS-CoV with their start-stop coordinates 
within the genome (“end” coordinates include the stop codon) 


Several studies have been performed to show the evolu- 
tionary aspects of overlapping gene sets in viruses [4, 7, 
23-28]. Here, we report a case study on a set of overlap- 
ping proteins in severe acute respiratory syndrome coro- 
navirus (SARS-CoV). SARS is characterized by acute 
pulmonary inflammation with a case/fatality rate of about 
10 % [29, 30]. SARS-CoV first emerged in 2002 in 
Guangdong province, China, and developed into a wide- 
spread epidemic in the spring of 2003 [29]. It is believed 
that the virus originated from a bat reservoir [31-33]. 
Although the epidemic ended in July 2003, it is not unli- 
kely for SARS-CoV or a similar virus to re-surface again. 
This view is supported by the recent emergence of a new 
human betacoronavirus of lineage c, Middle-East respira- 
tory syndrome (MERS) CoV [34—37], which is transmitted 
to humans from dromedary camels [38-40] but may have a 
bat origin as well [41-43]. As of October 16, 2014, 877 
laboratory-confirmed human MERS cases have been 
reported since September 2012, including at least 317 
deaths [44]. 

The 30-kb single-stranded SARS-CoV RNA genome 
codes for at least 28 proteins. In the 3’-proximal third, 
small open reading frames (orfs) coding for the so-called 
accessory proteins are interspersed among the genes coding 
for the structural proteins [45, 46]. These include orf3a/b, 
orf6, orf7a/b, orf8a/b, orf9b, and possibly orf9c (Fig. 1) 
[45-49]. Accessory proteins of SARS-CoV are thought to 
be important players involved in viral pathogenicity [47- 
50]. However, reverse genetic studies have demonstrated 
that these proteins are not required for viral replication in 
cell culture or transgenic mice expressing human ACE2, 
the receptor for SARS-CoV [51-53]. Here we present the 
results of an investigation into the evolution and 
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differential selection affecting the orf9b protein. This gene 
completely overlaps with the gene coding for the 
422-amino-acid residue nucleocapsid (N) protein (Fig. 1). 
Recently, it has been proposed that leaky ribosomal scan- 
ning is responsible for the production of the overlapping 
orf9b protein [54]. Orf9b codes for a small protein of 98 
amino-acid residues, which is found in SARS-CoV-infec- 
ted cells [55]. Antibodies against this protein have been 
detected in the sera of SARS patients, demonstrating that 
the protein is produced during infection [50, 56]. There has 
also been a report of yet another overlapping gene, orf9c, 
within the nucleocapsid gene, coding for a predicted pro- 
tein of 70 amino-acid residues [45, 46] (see Fig. 1). Until 
now, no evidence of orf9c expression has become available 
and since it is not well annotated, we will not include this 
gene in our analysis. 

In case of the overlapping nucleocapsid and orf9b genes 
of SARS-CoV, we are in the fortunate and rare situation 
that a large body of sequence data is available, as many 
isolates of the virus and its relatives from civets and bats 
were analyzed during and after the outbreak of 2003 [57]. 
In addition, crystal structures have been determined for 
both the overlapping proteins [58, 59], allowing conclu- 
sions that would not be possible in most cases of over- 
lapping viral genes. In order to shed light onto the 
evolution of orf9b in concert with the nucleocapsid gene, 
we analyzed the codon-usage patterns of the two genes and 
the effects of the overlap on their mutation rates as well as 
on the three-dimensional structures of their protein 
products. 


Methods 


Seventy full-length genomic sequences including one ref- 
erence sequence of SARS-CoV (isolate Tor2) were 
retrieved from the GenBank database (http://www.ncbi. 
nlm.nih.gov/Genbank/index.html). Among these, 37 were 
from human SARS-CoV isolates, 15 from civet SARS- 
CoV, and 18 (including the newly discovered SL-CoV- 
WIV1 [33]) from bat betacoronaviruses lineage b. Acces- 
sion numbers are given in Table SI (Online Resource 1). 
Unless explicitly stated otherwise, all these viruses are 
collectively called “SARS-CoV” in this work. These full- 
length genomic sequences were parsed and corresponding 
gene and amino-acid sequences were collected in a local 
database for further analysis. 


Codon-usage analysis 
Codon-usage analysis was done using the “sequence ana- 


lysis program” within the Sequence Manipulation Suite 
[60]. This program accepts one or more nucleotide 
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sequence(s) and returns the number and frequency of each 
codon type. Relative Synonymous Codon Usage (RSCU) 
values were calculated using the frequencies of each codon 
type of a nucleotide sequence. The RSCU value for a 
codon i is calculated by: RSCU (1) = (observed 1)/ 
(expected 1), where “observed 1” is the observed frequency 
of codon i in a gene and “expected 1” is the frequency 
expected assuming equal usage of synonymous codons for 
an amino acid in a gene. 

The RSCU index is a measure to assess whether a 
sequence shows a preference for particular synonymous 
codons, 1.e., codons that code for the same amino acid. The 
comparison was done at the level of codon usage between 
overlapping and non-overlapping coding regions. The 
RSCU values obtained from the overlapping orf9a/9b genes 
were then compared with those of the non-overlapping 
regions of the same viral genome by means of the Pearson 
correlation coefficient “r” [61]. Values of rrange from —1 to 
+1, reflecting a completely different and identical degree in 
the usage of synonymous codons, respectively. Discordant 
usage suggests the gene to be relatively new [62, 63]. 


Mutation rate analysis 


Mutation rate analysis was performed by first aligning both 
the nucleotide and protein sequences using ClustalX ver- 
sion 2.1 [64]. Redundant sequences (from multiple human 
patients) were manually removed. DnaSP 5.10 [65] was 
used to calculate the number of synonymous and nonsyn- 
onymous substitutions at the overlapping gene regions of 
the N gene and the orf9b gene. 


Entropy-plotting of alignments 


Entropy-plotting of alignments was done to determine the 
variations occurring in the overprinted N protein and the 
overprinting orf9b protein. Variation in the three sets of 
nucleotide sites of a codon and the variation in the corre- 
sponding amino-acid residues in the overlapping proteins 
were studied by plotting the entropy (variability) of the 
aligned overlapping nucleotide and protein sequences, as 
implemented in the BioEdit software v7.0.0. [66]. The 
entropy is defined as a measure of uncertainty at each 
position in a set of aligned nucleotide or protein sequences 
[66]. The cumulative entropy is the sum of all the entropy 
values calculated at each position in a sequence. For cal- 
culating the entropy, sequences are treated as a matrix of 
characters and the maximum number of different characters 
found in a column (column of aligned sets of nucleotide or 
protein sequences) defines the maximum total uncertainty or 
the “entropy” [66]. The entropy H is calculated by HU) = 
—) “f(b,])In(f(b,1)), where H(1) is the uncertainty (entropy) at 
position 1, b represents a residue (out of the allowed choices 


for the sequence under investigation), and f(b,l) is the fre- 
quency at which residue b is found at position |. We deter- 
mined the frequency of substitutions at each codon site 1n the 
overlapping region of the nucleocapsid/orf9b gene to find 
out the evolutionary strategy followed by this set of over- 
lapping genes in SARS-CoV. There are 98 codons in the 
overlapping region and we studied the variation of nucleo- 
tides (294 codon positions) and their corresponding amino 
acids in the overlapping nucleocapsid and orf9b gene region 
of 70 SARS-CoV genomes. 


Inspection of crystal structures and prediction 
of disorder 


Inspection and presentation of crystal structures of the 
N-terminal domain (NTD) of the N protein (orf9a) [58] and 
the orf9b protein of SARS-CoV [59] were performed using 
the program Pymol (Schrodinger), in order to assess pos- 
sible consequences of the overlapping genes for the three- 
dimensional structures of the protein products. Disorder 
prediction was also carried out for these overlapping pro- 
teins because crystallographic information is not available 
for the N-terminal 46 residues of the overprinted N protein. 
For this purpose, the DisProt VSL2 intrinsic-disorder pre- 
diction program was used [67]. 


Results 
Codon usage in the overlapping gene set 


The first step of our analysis was to establish a relationship 
between the codon usage in overlapping and non-overlap- 
ping genes in the betacoronavirus SARS-CoV. Two-thirds 
of the SARS-CoV genome comprise orflab, which encodes 
the viral polyproteins ppla and pplab. The 3’-proximal 
third comprises orfs encoding the structural proteins, 1.e., 
spike (S), envelope (E), membrane (M), and nucleocapsid 
(N), as well as several accessory proteins as described 
previously [45, 46]. The non-overlapping regions of the 
genome (Fig. 1) were combined into a single unit including 
the orfla, orflb, spike, envelope, membrane, and orf6 
genes. On the other hand, the genes under study here, full- 
length nucleocapsid and the overlapping orf9b, were con- 
sidered distinct sets of data. The remaining accessory 
proteins were not included in this comparison because they 
contain partially overlapping regions. 

The subsequent correlation analysis (Table 1) shows 
that the internal overlapping gene (orf9b) exhibits a choice 
of synonymous codons highly different from that occurring 
in the non-overlapping gene set of SARS-CoV, exhibiting 
an r value of —0.01. For example, of the eight proline 
residues in the orf9b protein, five (63 %) are coded by 
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Table 1 Correlation between the codon-usage patterns of the overlapping N and its overprinting internal genes of SARS-CoV, as well as of 
other members of the genus Betacoronavirus, 1.e., BCoV, MHV, and MERS-CoV, each with the non-overlapping coding regions in their genome 


Betacoronavirus Proteins 

SARS-CoV Nucleocapsid 
Orf9b 

BCoV Nucleocapsid 
internal protein 

MHV Nucleocapsid 
internal protein 

MERS-CoV Nucleocapsid 


hypothetical internal protein 


Table 2 Synonymous and nonsynonymous substitutions in overlap- 
ping and non-overlapping regions of the SARS-CoV nucleocapsid 
gene and in the orf9b gene 


Gene regions of: K, K, K,/K, = © 
Nucleocapsid (overlapping part) 0.41 0.73 0.56 
Nucleocapsid (non-overlapping part) 0.37 0.59 0.62 
Orf9b 0.53 0.43 1.23 


CCC, whereas in the non-overlapping gene set, less than 
9 % of proline residues use this codon. A much higher 
degree of concordant relationship is seen between the 
overprinted N gene of SARS-CoV and the non-overlapping 
gene set (r value of 0.62). 

Codon-usage analysis for other betacoronaviruses con- 
taining a hypothetical overlapping “internal” gene within 
their nucleocapsid gene was also carried out for comparison. 
The nucleocapsid genes of Bovine Coronavirus (BCoV, 
NCBI accession number NC_003045), Mouse Hepatitis 
Virus (MHV, AC_000192), and Middle-East Respiratory 
Syndrome coronavirus (MERS-CoV, NC_019843) display a 
similar positive correlation when compared to the rest of the 
genome, with r values of 0.67, 0.66, and 0.57, respectively, 
whereas their corresponding “internal” genes have r values 
of —0.11, 0.00, and —0.13, respectively (Table 1). 


Effect of the overlap on the mutation rate in the N 
and orf9b genes and evolutionary strategy adopted 
by this overlapping gene set 


The ratio w of nonsynonymous (K,) to synonymous (K,) 
nucleotide substitution rates is an indicator of selective 
pressure on genes. A ratio significantly greater than 1 
indicates positive selective pressure. A ratio around 1 
indicates either neutral evolution at the protein level or an 
averaging of sites under positive and negative selective 
pressure. A ratio less than | indicates pressure to conserve 
protein sequence, i.e., “purifying selection” [68]. 


Q) Springer 


Correlation 
coefficient (r) 


Number of 
amino-acid residues 


422 0.62 
98 —0.01 
448 0.67 
207 —0.11 
455 0.66 
136 0.00 
All 0.58 
112 —0.13 


The overlapping gene regions of nucleocapsid and orf9b 
show differences in the evolutionary rates. The overprinted 
region of the nucleocapsid gene in SARS-CoV has a K,/K, 
ratio of 0.56 (the K,/K, ratio is 0.62 for the non-overlap- 
ping part of the N gene). On the other hand, the orf9b gene 
has a K,/K, ratio greater than | (Table 2), which means 
that the orf9b protein is subject to positive selection pres- 
sure and is evolving at a faster rate, compared to the 
overprinted N gene. Remarkably, due to the difference in 
frame phase (see below), the same stretch of genome has 
thus different evolutionary rates when coding for each of 
the two different proteins. This observation prompted us to 
analyze the nucleotide variations at each of the three 
nucleotide positions of the codons. 

Upon a point mutation, the position at which nucleotide 
substitution occurs within a codon reflects whether the 
substitution would be synonymous or not. When there is a 
nucleotide substitution at the first codon position, it causes 
an amino-acid change in 60 out of 64 cases. The four 
exceptions occur because of the partial degeneration of 
codons at this position. At this codon site, there are four 
possibilities, which would result in a synonymous substi- 
tution: two each in codons for leucine (UUA vs. CUA, 
UUG vs. CUG) and arginine (AGA vs. CGA, AGG vs. 
CGG). When there is a nucleotide substitution at the sec- 
ond codon position, it results in an amino-acid change in 63 
out of 64 cases. At this codon site, the only substitution that 
is synonymous occurs in the stop codons UAA versus 
UGA. Lastly, a nucleotide substitution occurring at the 
third codon position causes a change in amino acid in only 
16 out of 64 cases because of codon degeneracy at the third 
nucleotide position [25, 62, 63]. 

As a result of leaky ribosomal scanning [54], translation 
of orf9b begins at the 10th nucleotide position of the N 
gene (Fig. 2a), resulting in a +1 difference in reading 
frame between orf9b ((+1)-phase frame) and the N gene 
(O-phase frame). Thus, the first nucleotide position of an N 
codon corresponds to the third nucleotide position of an 
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Fig. 2 a Top: The 5’ ends of the SARS-CoV N and orf9b genes. At 
nucleotide no. 10 of the N gene, translation of the overlapping orf9b 
gene begins, resulting in a phase difference of +1 for this gene 
relative to the N gene. Bottom: Codon-site substitutions in the two 
genes. Three types of substitution have to be distinguished: N2/9b1, 
N3/9b2, N1/9b3. b Variation of three sets of nucleotides (in magenta): 
N1/9b3, N2/9b1, and N3/9b2, in relation to the amino-acid variations 
(in blue) in the overlapping nucleocapsid and orf9b proteins. The x- 
axis represents the codon sites in case of graphs 1, 3, and 5, Le., 
nucleotide variations, whereas in case of graphs 2 and 4, the x-axis 
represents the amino-acid residue number. Note that the N protein 
overlaps with orf9b between its residues 4 and 101; however, in 
graph 2, which represents the amino-acid variations in the N protein, 
the x-axis is calibrated from 1 to 98 in order to facilitate the 
comparison with orf9b. The y-axis represents entropy. The green dot 
indicates the one case of synonymous N1/9b3 substitution that does 
not lead to an amino-acid exchange in the N protein because of the 
partial degeneration of the first nucleotide position in a codon (AGA 
and CGA both code for Arg). The red dot indicates a case of a two- 
nucleotide difference as a result of an N1/9b3 and an N2/9b1 
substitution that leads to an amino-acid exchange in the N protein. All 
bat betacoronaviruses of lineage b (with the exception of SL-CoV 
WIV 1 [33]) have Lys at this position, whereas all civet and human 
SARS-CoV isolates as well as bat SL-CoV WIVI1 have Pro (see text) 


orf9b codon (N1/9b3), the second position in an N codon 
corresponds to the first nucleotide position in an orf9b 
codon (N2/9b1), and the third position in an N codon 
corresponds to the second nucleotide position in an orf9b 
codon (N3/9b2) (Fig. 2a). Hence, as described above, a 
nucleotide substitution in the first position of a nucleo- 
capsid codon (N1/9b3) is likely to cause an amino-acid 


change in N, but not in orf9b, whereas substitutions in the 
third position of an N codon (N3/9b2) are probably 
nonsynonymous in orf9b, but synonymous in N. 

The nucleotide variation at the 98 N1/9b3, N2/9b1, and 
N3/9b2 positions (see Fig. 2a) of the overlapping N/orf9b 
gene region for all 70 sequences is shown in Fig. 2b. We 
find that the value of cumulative mutational frequency 
()°(H); see “Methods” section) of the overlapping region 
of the nucleocapsid protein is 4.7 and that of the orf9b 
protein is 15.3. This difference in the frequency of muta- 
tion was somewhat expected, based on the different @ 
values for the two genes (see Table 2). Moving on to the 
nucleotide level, we obtain cumulative entropy values of 
3.16, 3.49, and 5.44 for the N1/9b3, N2/9b1, and N3/9b2 
codon positions, respectively. The graphs are calibrated in 
the range of O-1 for accurate comparison of the results of 
protein sequences with nucleotide sequences (Fig. 2b). 

The higher rate of amino-acid variation in the orf9b 
protein is largely determined by nucleotide substitutions at 
the N3/9b2 sites. All the N3/9b2 nucleotide variations 
translate to amino-acid changes in the orf9b protein but are 
silent in the nucleocapsid protein. Amino-acid variations in 
the nucleocapsid protein are determined by substitutions of 
N1/9b3 nucleotides. In one instance, an N1/9b3 nucleotide 
variation results in a synonymous mutation in the nucleo- 
capsid protein (Fig. 2b, green dot). The amino acid at this 
position is arginine and this phenomenon occurs due to 
partial degeneration of the first nucleotide position as 
explained above. 

There are a few N2/9b1 mutations that impose amino- 
acid variations in both the proteins. An interesting varia- 
tion, corresponding to a concomitant N1/9b3 and N2/9b1 
exchange, results in an amino-acid difference at position 81 
of the nucleocapsid protein, within its well-ordered and 
overprinted part. All known genomic sequences of bat 
betacoronaviruses of lineage b feature the AAA triplet 
(coding for Lys) here, whereas all isolates of civet and 
human SARS-CoV have CCA (coding for Pro). The 
exception among the bat beta-CoVs of lineage b is the 
newly discovered SL-CoV WIV1, which is proposed to be 
the likely originator of SARS-CoV [33]; the N gene of this 
virus also has CCA coding for Pro at this position. Thus, 
there is a two-nucleotide difference between the codons in 
the bat CoVs (except SL-CoV WIV1) on the one hand and 
human or civet SARS-CoV on the other (see red dot in 
Fig. 4). In the orf9b protein, the corresponding codon 
(shifted by +1 in frame) is AAG (coding for Lys) in the bat 
betacoronaviruses of lineage b, and CAG (coding for Gln) 
in the civet and human SARS-CoV sequences as well as in 
SL-CoV WIVI1 [33]. Thus, only the N2/9b1 variation 
results in an amino-acid change in the orf9b protein and the 
N1/9b3 nucleotide variation is silent. 
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Fig. 3 a Structures of the SARS-CoV proteins investigated in this 
study, colored according to the B-factor (color scheme used is 
VIBGYOR, where Violet depicts the minimum and red depicts the 
maximum value of B-factor) averaged for each amino-acid residue; 
(left) NTD of nucleocapsid protein (overall average B-factor 
11.19 A PDB code 20FZ [58] ): (right) Dimer of the orf9b protein 
(overall average B-factor 100.8 AC PDB code 2CME [59] ). 
b Disorder prediction result for the NTD of the SARS-CoV 
nucleocapsid protein and its overprinting counterpart, the orf9b 
protein, calculated using the program DisProt VSL2B [67]. The 
degree of order—disorder lies within the range of 0 (well ordered) to 1 
(highly disordered) 


Effect of overlaps on the three-dimensional structures 
of the protein products 


Crystal structures are available for both the N-terminal 
RNA-binding domain (NTD) of the SARS-CoV nucleo- 
capsid protein [58] (PDB code: 2OFZ) and the orf9b pro- 
tein [59] (PDB code: 2CME). The N-terminal 46 residues 
of the NTD have been excluded from the fragment that was 
crystallized, as they are believed to be disordered on the 
basis of secondary structure prediction, limited proteolysis 
experiments, and sequence conservation. The well-ordered 
global NTD (residues 47—175) comprises an antiparallel 
B-sheet core and a f-hairpin protruding from it (Fig. 3a). 
The orf9b protein forms a two-fold symmetric dimer 
comprising two adjacent f-sheets (Fig. 3a). In the central 
hydrohobic cavity between the monomers, electron density 
for a lipid molecule was detected [59]. The global part of 
the nucleocapsid NTD exhibits reduced flexibility, as evi- 
dent from the atomic temperature factors (B-factors) for the 
polypeptide. The average B-factor for the NTD (residues 
47-175) is 11.2 A [58]. In contrast, the orf9b protein 
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appears to be much more flexible; its average B-factor is 
100.8 A? [S59]. Even though a number of factors contribute 
to the B value, in particular the degree of disorder of the 
crystal, rigid-body movements of the molecules, experi- 
mental errors, etc., it is evident from this huge difference 
between the B values for the two structures that the orf9b 
polypeptide chain is very flexible. In line with this, the 
segments between residues 1-8 and 26-37 in the orf9b 
protein are not visible in the electron-density maps. This 
result supports the notion that overprinting proteins tend to 
show higher flexibility and disorder [1]. In contrast, all 
residues are well defined by electron density in the NTD 
crystal structure of the overlapping N protein. However, it 
should also be mentioned that probably, many residues are 
disordered or highly flexible among the N-terminal 46 
residues of the NTD, which have not been included in the 
construct employed in crystal structure determination but 
belong to the overlapping region (with the exception of the 
first three residues) as seen in a prediction using the pro- 
gram DisProt VSL2B [67] (Fig. 3b). 

We also made an attempt to locate the amino-acid 
variations, which we identified in the 70 sequences of our 
data set, in the three-dimensional structures of the nucle- 
ocapsid [58] and orf9b [59] proteins. In the overprinted part 
of the nucleocapsid NTD, the majority of mutations occur 
in the unstructured N-terminal region (residues 1-46). 
Residue 81, which is Lys in all sequenced bat betacoro- 
naviruses of lineage b (except SL-CoV WIV1) but Pro in 
all civet and human SARS-CoV strains, is located at 
position 2 of a surface-exposed type-II f-turn of the 
sequence Gly-Pro-Asp-Asp. In the orf9b structure, the 
corresponding residue (Gln in SARS-CoV and SL-CoV 
WIV 1) is not defined by electron density and hence part of 
a presumably disordered region. In fact, in the orf9b pro- 
tein, all mutations occur in the regions that have been 
reported to be disordered [59]. Thus, the general observa- 
tion that mutations are more commonly localized in regions 
of no regular secondary structure (such as loops etc.) rather 
than in o-helices or f{-strands, remains valid for this 
overprinting protein. 


Discussion 


Previous work has suggested codon usage as a measure to 
determine the relative age of a gene. A discordant rela- 
tionship in the codon usage of a particular gene, when 
compared with the rest of the genes, suggests that the gene 
has evolved recently [1, 25, 28, 62, 63]. It has been shown 
that this phenomenon can be applied to identify over- 
printing viral proteins [63]. Research on overlapping pro- 
tein products in RNA viruses has suggested that proteins of 
the overprinting genes have a tendency to be structurally 
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disordered [1]. Taking into account these observations, we 
have analyzed the sequence and structural properties of the 
overlapping proteins nucleocapsid and orf9b in SARS- 
CoV, in the hope of gaining insight into the creation of 
novel proteins in RNA viruses. Studies revealing differ- 
ential selective pressure during the evolution of overlap- 
ping viral genes [23—27, 62] led us to probe the selection 
pressure acting on the overlapping N and orf9b genes. 

Codon-usage analysis shows that the overprinting orf9b 
has a discordant codon-usage pattern whereas the over- 
printed region of the nucleocapsid gene has a codon-usage 
pattern similar to the non-overlapping regions of the 
SARS-CoV genome. The discordant codon-usage pattern 
suggests a relatively more recent acquisition of the orf9b 
gene [62, 63]. Moreover, no sequence similarity exists 
between the orf9b protein and any other known proteins 
[45, 46]. Among other coronaviruses, internal open reading 
frames within the nucleocapsid gene have been reported for 
members of Betacoronavirus lineage a, for example 
Bovine Coronavirus, Mouse Hepatitis Virus, and human 
coronaviruses HKU1 and OC43 [69-71]. Moreover, they 
can also be found in members of Betacoronavirus lineage 
c, 1.e., MERS Coronavirus [36] as well as bat coronaviruses 
HKU4 and HKUS. However, there is little in common 
between the orf9b gene of SARS-CoV and these so-called 
“internal” genes [49]. Hence, we can conclude that orf9b is 
a novel gene. 

A previous study performed on the translation mecha- 
nism employed in orf9b expression describes the presence 
of a sub-optimal Kozak sequence for the N gene [54]. In 
contrast, the start codon of the orf9b gene resides within an 
optimal Kozak sequence, but the properties of the second 
initiation site in leaky ribosomal scanning do not influence 
the decision of the ribosome to stop at or bypass the first 
AUG [9]. Xu et al. [54] showed that the expression level of 
the orf9b gene is relatively weaker when compared to that 
of the N gene. This is probably caused by only a fraction of 
ribosomes bypassing the first AUG, but may also be 
influenced by the fact that the codon usage of orf9b is 
different from that of the rest of the SARS-CoV genes 
(Table 1). In addition, although orf9b production appears 
to depend on leaky ribosomal scanning, we cannot exclude 
that the expression of this gene is modulated by other 
SARS-CoV proteins, as described very recently for mem- 
bers of the family Arteriviridae [21, 22]. The orf9b protein 
has been shown to interact with several other SARS-CoV 
proteins, for example Nsp5, Nsp14, and the orf6 protein 
[52, 72]. It remains to be investigated whether any of these 
can transactivate or suppress the expression of the alter- 
nating gene, orf9b. 

The evolution of overlapping, frame-shifted genes is 
subject to extra constraints. A slower rate of evolution has 


been demonstrated in overlapping genes in a number of 
viruses [7, 23-26]. There is an imposing constraint in 
evolution of overlapping genes due to the fact that a 
favorable or even neutral substitution in one reading frame 
could prove harmful for the other reading frame. Therefore, 
even a synonymous, favorable, or neutral nucleotide sub- 
stitution in one reading frame might be discarded, as it 
could be deleterious in the other reading frame. As a result, 
positive selection of overlapping genes in general is 
severely restricted [7, 23-26]. However, here we demon- 
strated that the overlapping region of the N gene is rather 
evolutionary conserved when compared to the orf9b gene. 
Orf9b features a higher evolutionary rate that is attained 
mainly via N3/9b2 substitutions. This mechanism of 
independent evolution is similar to the mechanism of 
“independent adaptive selection” recently described for 
the gene encoding the Hepatitis B Virus (HBV) surface 
protein which completely overlaps with the polymerase 
gene [27]. 

A structural comparison of the overlapping NTD of 
the nucleocapsid protein with the orf9b protein indicates 
the latter to be more flexible (Fig. 3). However, 50 % of 
the overlapping orf9b protein corresponds to the struc- 
turally undetermined segment 1-46 of the NTD of the 
nucleocapsid protein. This region is predicted to be 
intrinsically disordered (see Fig. 3b) [73]. Consequently, 
the overprinting orf9b protein is more ordered in its 
N-terminal half than the overprinted N-terminal segment 
1-46 of the nucleocapsid NTD. More studies on over- 
lapping viral proteins with known three-dimensional 
structures would be necessary to refine our understanding 
of the evolutionary and structural constraints caused by 
this phenomenon. 

Currently, the function of the orf9b protein is unknown, 
although its crystal structure [59] suggests that it may bind 
lipids. Since the protein evolved with a high rate and 
independently of the overprinted nucleocapsid protein, it 
may have the potential to acquire new functions in a rel- 
atively short time. The analysis presented here may form a 
basis to follow the future evolution of orf9b and a starting 
point for investigating the so-called “internal genes” 
overlapping with the nucleocapsid gene in other coronav- 
iruses, including MERS-CoV. Beyond its importance for 
understanding coronavirus evolution, our study may have 
implications for the analysis of overlapping reading frames 
in other viruses as well. 
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